Actually, a fairly large selection of statistical methods can be listed under the title “dimensionality reduction techniques”. Most often (nearly always, that is!) the real-world phenomena are multidimensional: they may consist of not just two or three but 5 or 10 or 20 or 50 (or more) dimensions. Of course, we are living only in a three-dimensional (3D) world, so those multiple dimensions may really challenge our imagination. It would be easier to reduce the number of dimensions in one way or another.
We shall now learn the basics of two data science based ways of reducing the dimensions. The principal method here is principal component analysis (PCA), which reduces any number of measured (continuous) and correlated variables into a few uncorrelated components that collect together as much variance as possible from the original variables. The most important components can be then used for various purposes, e.g., drawing scatterplots and other fancy graphs that would be quite impossible to achieve with the original variables and too many dimensions.
Multiple correspondence analysis (MCA) and other variations of CA bring us similar possibilities in the world of discrete variables, even nominal scale (classified) variables, by finding a suitable transformation into continuous scales and then reducing the dimensions quite analogously with the PCA. The typical graphs show the original classes of the discrete variables on the same “map”, making it possible to reveal connections (correspondences) between different things that would be quite impossible to see from the corresponding cross tables (too many numbers!).
Briefly stated, these methods help to visualize and understand multidimensional phenomena by reducing their dimensionality that may first feel impossible to handle at all.
library(readr)
library(tibble)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(FactoMineR)
human0 <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv", show_col_types = FALSE)
Move the country names to rownames (see Exercise 5.5).Show a graphical overview of the data and show summaries of the variables in the data.Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
(0-3 points)human <- column_to_rownames(human0, "Country")
Let’s look at the data, starting with some descriptive statistics:
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Based on the difference between mean and median, and the ranges of
the variables, Edu.Exp and Parli.FM seem the
most normally distributed. Edu2.FM, Labo.FM
and Life.Exp are more skewed or have long tails
(bdifference between mean and median, and the ranges of the variables).
GNI and Mat.Mor are closer to log-normal (or a
related distribution), and maybe Ado.Birth as well.
corrplot(cor(human), method="circle",type = "upper", tl.pos = "d")
fig. 1.1, correlation matrix of variables
There are definitely strong correlations in the data. We can identify four different sets of variables from this matrix.
First, there is one set of very strongly correlated
variables: Mat.Mor, Ado.Birth,
Life.Exp, and Edu.Exp. These will likely be
very strongly represented in the first principal component of a PCA.
Second, these four variables also correlate very strongly with
GNI and Edu2.FM, more strongly than that pair
of variables correlate between themselves. Since they correlate less
amongst themselves, there seems to be more degrees of freedom here.
Then there is a third group, Labo.FM and
Parli.F. They seem much less correlated compared to the
other set of variables, however, they are most strongly correlated with
each other. Again, seemingly more degrees of freedom in this group.
ggpairs(human, progress = FALSE)
fig. 1.2, distributions and scatterplots of variables
The correlations become much more clear when looking at the
scatterplots. Some are clearly linear, like Edu.Exp and
Life.Exp. It’s even clearer that GNI might
need to be log-transformed, based on the distribution and the scatter
plot shapes. Possibly Mat.Mor and Ado.Birth
too.
The other distributions look like skewed normal-like distributions, all of them have one big main mode, although there are some interesting concentrations in certain parts of the distributions.
E.g., in Life.Exp, there is a clear plateau from 60 to
roughly 65 or so. This may be related to quality of life before and
after retirement.
Another example is this curious smaller mode in the
Edu2.FM around 0.5, and a subsequent drop after that. This
suggests that there are two overlaid populations, one where the mode is
a little under 1, another where the mode is around 0.5. I don’t have a
hypothesis for what could be causing this difference in behaviors in the
two populations of countries, but it’s a very interesting feature of the
distribution. Unfortunately we’ve already thrown away the two variables
that this variable is based on, which may have given us some clues to
what this effect is.
human.log <- human %>% mutate(GNI = log(GNI)) %>% mutate(Mat.Mor = log(Mat.Mor)) %>% mutate(Ado.Birth = log(Ado.Birth))
ggpairs(human.log, progress = FALSE)
fig. 1.3, distributions and scatterplots of variables after log transform
The scatter plots now seem to form much clearer trend lines, which seems to indicate this was a good idea. Let’s recheck the correlation matrix, just to check that the scale of the data didn’t mix things up too much.
corrplot(cor(human.log), method="circle",type = "upper", tl.pos = "d")
fig. 1.4, correlation matrix using log transformed dataset
Well, this does seem to have changed a lot of things. The
correlations are now much more even among five variables:
Life.Exp, Edu.Exp, GNI,
Mat.Mor, and Ado.Birth. The correlation with
Edu2.FM is a little weaker, and the Parli.F
and Labo.FM variables are again least correlated with the
others.
Perform principal component analysis (PCA) on the raw (non-standardized) human data.Show the variability captured by the principal components.Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.
(0-2 points)Since the log transform was not part of the assignment, let’s forget it for now, and do PCA on the raw dataset and look at the coefficients.
pca_raw <- prcomp(human)
pca_raw
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
PC1 is nearly all GNI, PC2 is nearly all Mat.Mor, and PC3 is nearly
all Ado.Birth, etc. This is a problem. Let’s see why by looking at the
summary of the transformed rows, which are in
pca_raw$x.
summary(pca_raw$x)
## PC1 PC2 PC3 PC4
## Min. :-105495 Min. :-858.85 Min. :-87.306 Min. :-37.798
## 1st Qu.: -6885 1st Qu.: -75.30 1st Qu.:-11.498 1st Qu.: -6.643
## Median : 5587 Median : 60.76 Median : 2.681 Median : 1.787
## Mean : 0 Mean : 0.00 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 13430 3rd Qu.: 117.75 3rd Qu.: 13.592 3rd Qu.: 8.200
## Max. : 17051 Max. : 200.85 Max. : 99.552 Max. : 26.267
## PC5 PC6 PC7 PC8
## Min. :-16.0452 Min. :-4.38647 Min. :-0.70772 Min. :-0.43774
## 1st Qu.: -2.0796 1st Qu.:-1.10824 1st Qu.:-0.08793 1st Qu.:-0.08919
## Median : 0.5794 Median : 0.07198 Median : 0.02775 Median :-0.01657
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 2.3562 3rd Qu.: 1.09716 3rd Qu.: 0.10801 3rd Qu.: 0.08364
## Max. : 7.6214 Max. : 3.80771 Max. : 0.80521 Max. : 0.50052
As we saw earlier, GNI per capita goes from roughly 500 to 100000, while the other variables were all max in the hundreds. Since PCA will maximize the spread along each principal component, we will get some bad decisions, because the distance scales in these variables are not comparable (the dataset needs to be standardized).
If we look at a summary of the pca, this is even more clear:
summary(pca_raw)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
This tells us that PC1 captures nearly all of the variance (>0.9999) in the dataset, which again is due to the dataset not being standardized.
Knowing this, we can’t expect a very good biplot, but let’s plot one anyway.
pca_pr <- round(100*summary(pca_raw)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_raw, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("Negative GNI,", labels[1]), ylab = paste("Maternal mortality,",labels[2]))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
fig. 2.1, biplot for PCA of the raw data
Again, we see that PC1 is pretty much only GNI. This is evident from the fact that the GNI arrow is the only visible one. This can also be inferred from looking at the order of the countries, high GNI countries are to the left, low GNI countries to the right.
I’ve named the principal components according to the variable that they’ve picked up from the dataset (negative GNI and Maternal Mortality), since each of the first few components are almost aligned with one dimension.
Standardize the variables in the human data and repeat the above analysis.Interpret the results of both analysis (with and without standardizing).Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.
(0-4 points)Let’s standardize, do a PCA, and look at the principal components.
pca_scaled <- human %>% scale %>% as.data.frame %>% prcomp
pca_scaled
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 -0.2625067
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 -0.1628935
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 -0.1659678
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 0.1523699
## PC6 PC7 PC8
## Edu2.FM -0.17713316 0.05773644 0.16459453
## Labo.FM 0.03500707 -0.22729927 -0.07304568
## Life.Exp 0.42242796 -0.43406432 0.62737008
## Edu.Exp 0.38606919 0.77962966 -0.05415984
## GNI -0.11120305 -0.13711838 -0.16961173
## Mat.Mor -0.17370039 0.35380306 0.72193946
## Ado.Birth 0.76056557 -0.06897064 -0.14335186
## Parli.F -0.13749772 0.00568387 -0.02306476
This already looks much better. Let’s see the spreads.
summary(pca_scaled$x)
## PC1 PC2 PC3 PC4
## Min. :-3.4128 Min. :-2.79379 Min. :-2.13797 Min. :-3.43435
## 1st Qu.:-1.4651 1st Qu.:-0.61041 1st Qu.:-0.61625 1st Qu.:-0.51644
## Median :-0.4934 Median : 0.04166 Median :-0.07731 Median : 0.05376
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.0237 3rd Qu.: 0.68895 3rd Qu.: 0.47541 3rd Qu.: 0.54936
## Max. : 6.0717 Max. : 3.12552 Max. : 2.56743 Max. : 2.27574
## PC5 PC6 PC7 PC8
## Min. :-1.54550 Min. :-1.82764 Min. :-0.98931 Min. :-1.0576
## 1st Qu.:-0.41848 1st Qu.:-0.26156 1st Qu.:-0.29823 1st Qu.:-0.1581
## Median :-0.06288 Median :-0.02236 Median :-0.02144 Median : 0.0261
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.33133 3rd Qu.: 0.34286 3rd Qu.: 0.29026 3rd Qu.: 0.1657
## Max. : 2.74409 Max. : 1.49002 Max. : 1.13872 Max. : 1.3844
Ok, the range of the dimensions seem much more sensible now, they are all in the same order of magnitude. Let’s see how much variance each principal component explains.
summary(pca_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
PC1 explains more than half the variance, not bad! All principal components do seem to capture at least one percent of the variance however, so we weill be losing information if we decide to cut this off.
pca_pr <- round(100*summary(pca_scaled)$importance[2, ], digits = 4)
labels <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"), xlab = paste("neg. Human development,", labels[1]), ylab = paste("Gender equality,",labels[2]))
fig. 3.1, biplot for PCA of the scaled data
Much better, and better yet, all original variables are roughly axis-aligned. I’ve added descriptive labels based on which variables align with which axes, more on these in section 4.
As already discussed, the raw data was a bad fit for PCA due to the different orders of magnitude in the dispersion among the variables. PCA on raw data essentially just picked out one variable at a time in decreasing order of scale.
PCA on the scaled data performs much better.
I’ve decided to name PC1 neg. Human development,
inspired by the name of the original dataset. This principal component
measures the welfare of the country in terms of health (Mat.Mor,
Ado.Birth, Life.Exp), standard of living (GNI per cap., ppp adjusted),
and education (Edu.Exp, Edu2.FM). The value of this component is smaller
with better outcomes in these domains, which is where the
neg. comes in. I would rather take the negative of this PC1
and call it Human development, but this is what the PCA
gave us.
PC2 I’ve called gender equality, because it measures female participation in political decision-making (0.65 * Parli.F) and female participation in the labour market (0.72 * Labo.FM).
This is perhaps not the whole story, because this component has a positive contribution from maternal mortality rates and adolescent birth. Maybe this component only measures whether society has moved away from traditional gender roles. In that case, this can be seen as a cultural liberal/conservative axis.
PC3 is positively correlated with female participation in political decision-making (0.73 * Parli.F), but negatively correlated with the ratio of female participation in labor markets and secondary education (-0.584 * Labo.FM, -0.24 * Edu2.FM).
Let’s flip this around and consider NPC3 = -PC3, it’s
easier to reason about that way. If a country has a high level of female
MPs relative to the female education and participation in the labor
market, then NPC3 is high.
So this principal component measures how women are viewed as in society. Are they seen as leaders (and elected into parliament)? then PC3 is low. Are they seen as useful in the workforce but not as leaders? then PC3 is high.
The principal components are getting harder and harder to reason about as we go further down the list.
Roughly,
PC4 = 0.62 Edu2.FM - 0.72 GNI - 0.25 Mat.Mor.
These variables don’t seem to make a clear story. Edu2.FM should be very close to 1 for all developed nations, as high school dropouts are rare, and with negative GNI per cap. maybe this axis is about the economic development of the country? The maternal mortality rate is difficult to square with this.
This component might measure the relative focus on wealth as compared to other societal welfare in the country. With heavy emphasis on might.
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
(0-2 points)I already touched on this in the previous section, but as the first
principal components align with Mat.Mor,
Ado.Birth, Life.Exp, GNI,
Edu.Exp, and Edu2.FM, it seems to capture the
general wellbeing of society. Or rather, the lack thereof, since it’s
pointing in the other direction.
The second axis measures gender equality, as it aligns well with
Parli.F and Labo.FM.
There are some additional interesting correlations in the original
variables with the PCs. Mat.Mor and Edu.Exp
both point slightly in the positive PC2 direction. This doesn’t make a
lot of intuitive sense if we consider PC2 to capture gender equality or
liberal values. However, Mat.Mor is pointing in the
opposite direction from the social equality trend line, which may
suggest that the principal components aren’t aligned in any
particularily meaningful direction. PCA has maximiced the dispersion
along PC1 and PC2, but it does not necessarily mean that the axes are
meaningful in themselves, the axes may be slightly misaligned compared
to the labels I’ve given them.
This idea is supported by the fact that PC1 is bimodal, so the second, smaller mode might then be pulling the PCA slightly off the trend line. The other mode consists mostly of African countries and other developing nations. The fact that the other variables are distributed differently here might be due to artifacts of colonialism or some other big systemic differences between developed and developing nations.
ggplot(data.frame(pca_scaled$x), aes(x=PC1)) + geom_density()
fig. 3.2, PC1 distribution, clearly bimodal
Parli.F and Labo.FM are pointing in
slightly different directions from each other, with
Parli.FM slightly aligned with GNI, and
Labo.FM completely orthogonal to it. This suggests that
female labor participation has no effect on GNI (it doesn’t matter for
the economy what the sex of a worker is), but female leadership has a
positive correlation with GNI (of course, we don’t have a causal link
here, only a correlation).
Just for fun, let’s use the log transformed data to see if some of those correlations change
pca_log_scaled <- human.log %>% scale %>% as.data.frame %>% prcomp
summary(pca_log_scaled)
biplot(pca_log_scaled, cex = c(0.8, 1), col = c("grey40", "#dd4028"))
fig. 4.1, biplot for PCA of the standardized and log transformed data
Sure enough, if we look at the summary, we see that the first PC now captures 57% of the variance compared to 53.6% before, a better result. The first four PCs now capture 90% of the variance, after that we get diminishing returns.
pca_log_scaled
## Standard deviations (1, .., p=8):
## [1] 2.1505310 1.1259172 0.8681023 0.7727862 0.5558253 0.4410956 0.3827107
## [8] 0.3267292
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.32850815 0.01067147 -0.35894548 0.77560185 -0.36060516
## Labo.FM 0.03263713 0.72942195 -0.61074433 -0.23458714 0.05083415
## Life.Exp -0.42542035 -0.05622607 0.11141581 -0.03806539 0.26227538
## Edu.Exp -0.42035073 0.09697456 -0.06818573 -0.03839701 0.44933876
## GNI -0.43314660 -0.09480875 -0.01735880 0.06836804 0.13860792
## Mat.Mor 0.43672168 0.01162213 -0.02391875 0.21779671 -0.12263401
## Ado.Birth 0.38284906 0.03041796 -0.03340255 0.48515646 0.74256569
## Parli.F -0.09178670 0.66724454 0.69216873 0.23021937 -0.10502880
## PC6 PC7 PC8
## Edu2.FM 0.05763161 0.12121944 0.11626173
## Labo.FM 0.12840535 -0.11338286 -0.08313215
## Life.Exp 0.73375796 0.22248569 -0.38118853
## Edu.Exp -0.56661984 0.53272731 -0.03187199
## GNI -0.27117249 -0.74980849 -0.37876162
## Mat.Mor -0.17790440 0.22051498 -0.81597528
## Ado.Birth 0.11877107 -0.16353812 0.15412247
## Parli.F -0.03795809 -0.03959254 0.01489879
Looking closer at the coefficients, PC1, PC2, and PC3 have roughly the same interpretation as before. PC4 has changed now, but it is still as difficult to interpret as before.
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).
Load the tea dataset and convert its character variables to factors:
tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)
Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!).Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA.Comment on the output of the plots. (0-4 points)tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300 36
#View(tea)
The dataset seems to consist of 300 rows of 36 answers to questionnaire, each row represents one questionnaire.
Summary statistics follow.
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
Factominer documentation (https://rdrr.io/cran/FactoMineR/man/tea.html) says the first 18 vars are about how they drink tea, 19 is age, and the rest are personal questions and “product’s perception” which I’m choosing to interpret as how they think about tea.
There are a lot of variables here, so let’s choose a sensible subset of them to look at. Let’s choose the following variables, which I’m interpreting free-hand here, since I can’t find much metadata:
Tea: what kind of tea out of three types the respondent
prefers (green, black, Earl Grey)price: the price level of the tea that the respondent
prefersHow: whether the respondent takes tea as is, with
lemon, with milk, or in some other waysex: the sex of the respondentSPC: some kind of general social group for the
respondent (student, employee (white collar?), middle (management?),
senior)age_Q: the age group of the respondentfrequency: how often the respondent drinks teaof these, Tea, price, and How
are “active” variables, and the rest are “supplementary” variables.
As a sidenote, this dataset is very badly documented.
tea_filtered <- tea %>% dplyr::select(one_of("Tea", "price", "How", "sex", "SPC", "age_Q", "frequency"))
summary(tea_filtered)
## Tea price How sex SPC
## black : 74 p_branded : 95 alone:195 F:178 employee :59
## Earl Grey:193 p_cheap : 7 lemon: 33 M:122 middle :40
## green : 33 p_private label: 21 milk : 63 non-worker :64
## p_unknown : 12 other: 9 other worker:20
## p_upscale : 53 senior :35
## p_variable :112 student :70
## workman :12
## age_Q frequency
## +60 :38 +2/day :127
## 15-24:92 1 to 2/week: 44
## 25-34:69 1/day : 95
## 35-44:40 3 to 6/week: 34
## 45-59:61
##
##
Let’s look at a summary of the MCA of the dataset based on these specs above.
mca=MCA(tea_filtered,quali.sup=4:7 ,graph=F)
summary(mca)
##
## Call:
## MCA(X = tea_filtered, quali.sup = 4:7, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.423 0.399 0.383 0.355 0.335 0.330 0.316
## % of var. 12.688 11.967 11.497 10.653 10.043 9.907 9.484
## Cumulative % of var. 12.688 24.655 36.152 46.805 56.848 66.755 76.239
## Dim.8 Dim.9 Dim.10
## Variance 0.283 0.271 0.238
## % of var. 8.490 8.142 7.129
## Cumulative % of var. 84.729 92.871 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.720 0.409 0.056 | 1.341 1.503 0.196 | 0.789 0.541
## 2 | 0.782 0.481 0.216 | 0.593 0.294 0.124 | 0.071 0.004
## 3 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 4 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 5 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 6 | -0.357 0.100 0.027 | -0.078 0.005 0.001 | 0.344 0.103
## 7 | -0.462 0.168 0.231 | 0.110 0.010 0.013 | -0.707 0.435
## 8 | 0.782 0.481 0.216 | 0.593 0.294 0.124 | 0.071 0.004
## 9 | 0.497 0.195 0.083 | 0.222 0.041 0.016 | 0.544 0.257
## 10 | 1.104 0.961 0.443 | -0.678 0.384 0.167 | 0.138 0.016
## cos2
## 1 0.068 |
## 2 0.002 |
## 3 0.542 |
## 4 0.542 |
## 5 0.542 |
## 6 0.025 |
## 7 0.542 |
## 8 0.002 |
## 9 0.099 |
## 10 0.007 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 1.366 36.278 0.611 13.516 | -0.083 0.143 0.002
## Earl Grey | -0.440 9.794 0.348 -10.207 | 0.312 5.225 0.175
## green | -0.493 2.106 0.030 -2.996 | -1.636 24.612 0.331
## p_branded | -0.497 6.176 0.115 -5.856 | -0.218 1.257 0.022
## p_cheap | 1.116 2.289 0.030 2.982 | 1.313 3.361 0.041
## p_private label | 0.018 0.002 0.000 0.084 | -0.132 0.102 0.001
## p_unknown | 0.314 0.310 0.004 1.108 | 2.953 29.139 0.363
## p_upscale | 1.063 15.724 0.242 8.512 | -0.874 11.272 0.164
## p_variable | -0.188 1.036 0.021 -2.504 | 0.225 1.575 0.030
## alone | -0.275 3.870 0.140 -6.477 | -0.328 5.831 0.199
## v.test Dim.3 ctr cos2 v.test
## black -0.825 | 0.099 0.209 0.003 0.977 |
## Earl Grey 7.240 | -0.247 3.420 0.110 -5.741 |
## green -9.947 | 1.224 14.341 0.185 7.443 |
## p_branded -2.566 | 0.459 5.803 0.098 5.403 |
## p_cheap 3.509 | 0.315 0.201 0.002 0.841 |
## p_private label -0.626 | 1.039 6.569 0.081 4.928 |
## p_unknown 10.421 | 1.519 8.030 0.096 5.362 |
## p_upscale -6.999 | 0.310 1.476 0.021 2.483 |
## p_variable 2.999 | -0.913 27.080 0.497 -12.188 |
## alone -7.721 | -0.153 1.330 0.044 -3.614 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.611 0.359 0.207 |
## price | 0.324 0.559 0.565 |
## How | 0.334 0.279 0.378 |
##
## Supplementary categories (the 10 first)
## Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2
## F | -0.099 0.014 -2.071 | -0.001 0.000 -0.018 | -0.092 0.012
## M | 0.145 0.014 2.071 | 0.001 0.000 0.018 | 0.135 0.012
## employee | -0.153 0.006 -1.307 | -0.104 0.003 -0.891 | 0.027 0.000
## middle | 0.179 0.005 1.213 | 0.095 0.001 0.645 | 0.033 0.000
## non-worker | 0.281 0.021 2.531 | -0.214 0.012 -1.927 | -0.010 0.000
## other worker | 0.221 0.003 1.021 | 0.174 0.002 0.804 | 0.108 0.001
## senior | 0.226 0.007 1.419 | 0.086 0.001 0.543 | 0.009 0.000
## student | -0.350 0.037 -3.343 | 0.108 0.004 1.027 | -0.090 0.002
## workman | -0.327 0.004 -1.154 | 0.167 0.001 0.588 | 0.133 0.001
## +60 | 0.702 0.072 4.624 | -0.316 0.014 -2.078 | -0.033 0.000
## v.test
## F -1.930 |
## M 1.930 |
## employee 0.229 |
## middle 0.225 |
## non-worker -0.091 |
## other worker 0.498 |
## senior 0.056 |
## student -0.861 |
## workman 0.470 |
## +60 -0.218 |
##
## Supplementary categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.014 0.000 0.012 |
## SPC | 0.068 0.020 0.004 |
## age_Q | 0.130 0.022 0.014 |
## frequency | 0.011 0.007 0.026 |
Let’s start with a plot of the individuals (ind):
par(mfrow=c(2,2))
plot(mca, graph.type="ggplot", invisible=c("var","quali.sup","quanti.sup"),cex=0.8)
fig. 5.1 MCA “ind” variables, individuals
I see no real clusters, but there seem to be some duplicates. Dimension 1 covers 12.6% of the variance, and dimension 2 covers 12% of the variance, roughly the same. The scatter plot forms a triangle in this space, with a wide base forming around the Dim 1 = 0 and extending to the right.
Now let’s plot the var variables — Tea,
price, and How — which the plot function for
FactoMineR’s MCA colors black, red, and green, respectively.
plot(mca, graph.type="ggplot", invisible=c("ind","quali.sup","quanti.sup"), cex=0.8, habillage="quali")
fig. 5.2 MCA “var” variables, tea dirnking habits
Dimension 1 seems to cover: - most strongly (Tea: 0.61), whether the respondent prefers black tea or not (black +1.36, green -0.49, Earl Grey -0.44) - then (How: 0.334), whether the respondenr likes to add things to their tea (alone -0.28, lemon, milk, other all positive +) - and finally, no clear correlation with price, since both upscale and cheap are positive in dimension 1
Dimension 2 seems to cover: - whether the respondent is unlikely to prefer green tea (-1.636) - whether the respondent likes cheap tea (cheap +1.313, upscale -0.874)
The p_unknown category in the price has the highest
coefficient here, which makes this one tough to interpret.
Finally, let’s look at the demographics of the individuals and see how they distribute over this space:
plot(mca, graph.type="ggplot", invisible=c("ind","var","quanti.sup"), cex=0.8, habillage="quali")
fig. 5.3 MCA “quali” variables, demographics
It seems that age is the best explaining variable for the differences in respondents tea drinking habits (dim 1: 0.130, dim 2: 0.022), followed by SPC, which does encode some of the same information as age, so perhaps it is redundant. If we interpret the dimensions from before we could roughly say that: - older people are more likely to prefer generic black tea to green tea or Earl Grey - younger people are more likely to prefer Earl Grey - green tea is most likely to be preferred by people in the 25-34 age range who are “employees” (office workers?) - older people prefer to add milk, lemon, or something else to their tea
There is also a slight difference in the distribution of answers between sexes: - men more likely to prefer black tea - women more likely to prefer Earl Grey or green tea
The frequency of tea drinking is all over the place and forms no clear trendline in the biplot.
Social equality trend line
It is good to point out at this point that there is a clear trend line in the upper left corner of the biplot, moving from the right bottom to the top left. It seems like most of the countries are in this trend line. There is then a large spread of the rest of the countries, which is why this trend is not aligned with the principal components. Let’s call this trend line the social equality trend line, as all the countries at the top of this trend line have social welfare programs, such as universal healthcare, strong unions, and low GINI indices. (A bit hand-wavy, but this is a course assignment, not a research paper).